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Abstract 

We prove general exponential moment inequalities for averages of [0, re- 
valued iid random variables and use them to tighten the PAC Bayesian 
Theorem. The logarithmic dependence on the sample count in the enu- 
merator of the PAC Bayesian bound is halved. 



1 Introduction 

The relative entropy or Kullback Leibler divergence of a Bernoulli variable with 
bias p to a Bernoulli variable with bias q is given by 

KL(p,q)=p\n^ + (l-p)ln^. 

q l-q 

Suppose that X = {X\, X n ) is a vector of iid random variables, < Xi < 1, 
E [Xi] = fi and let M (X) = (\/n)^Xi be the arithmetic mean. We will derive 
the following inequality, valid for n > 8: 



E 



e nKL(M(X), M ) 



< (1) 



We also show that the square root on the right side gives the optimal order 
in n because for Bernoulli ({0, l}-valued) variables Xi we have the additional 
inequality, valid for n > 2, 



^fR<E 



e ni?L(M(X),/x) 



(2) 



We will also see that for Bernoulli variables the right side is independent of /i, 
so that the expectation E [e nKL ^ M ^ x " > '^] is the same for all Bernoulli variables 
and depends only on n. 

It is likely that the inequalities and J5J) are known. The upper bound £Q| 
can be applied to improve on the PAC-Bayesian Theorem (see e.g. |9].|11 | .|T5 | ') 
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in learning theory: Suppose one has a set of data Z with probability measure 
D and a set H of hypotheses h : Z — > [0, 1] (this already includes the usual 
loss- function). Suppose further that there is a ('prior') probability measure P 
on Ti. (assume Z and Ti to be finite to avoid questions of measurability) . Then 
for any 8 > 0, with probability greater than 1 — 8 a sample S = [Z\, Z n ) e Z n 
is drawn from D n such that for all ('posterior') probability measures Q on H 
we have for n > 2 



KL (E h „ Q [M (h (S))] , E h „ Q [E Z ^ D [h {z)}}) < 



KL (Q, P) + In i + In (2n) 



n - 1 

(3) 

Here h (S) refers to the vector h (S) = (h (Z\) , h (Z n )), so that M (h (S)) is 
the empirical loss of the hypothesis h. The expression KL (Q, P) refers to the 
relative entropy of the probability measures Q and P (see [H])- The importance 
to learning theory comes from the fact that Q may depend on S. Note that 
implies 



E h ~Q [E Z ^ D [h (z)]] < sup \ e : KL (E h ^ Q [M (h (S))] , e) < 



KL (Q,P) + In 



n - 1 

which can drive a learning algorithm to select a posterior Q by minimizing the 
sample-dependent right side . Among other applications ([7], J3j) the PAC 
Bayesian bound has been applied to prove generalisation error bounds for large 
margin classifiers such as support vector machines ([S], [lip. 

The right side of © has, with an overall factor of 1/ (n — 1), three terms: 
There is the relative entropy KL (Q,P), which can be interpreted as the in- 
formation gain in specializing from f to Q, an information normally extracted 
from the sample S. The term In (1/8) expresses the usual dependence on the 
confidence parameter 8, but the remaining In (2n) is difficult to understand: 
Why do we need it, can't it be altogether eliminated or at least reduced? 

We do not know the answer to the first two questions, but using we can 
essentially cut the term in half, replacing In (2n) by In (2y / n) for n > 8 and 
reduce the overall factor to 1/n. Our substitute for © then reads 

KL (E h „ Q [M (S)] , Eh~q [E„n [h (*)]]) < ^gl^±M±Mg^ , (4) 

n 

Our improvement is not spectacular, but significant when viewed in terms 
of the confidence parameter 8. It gives a slightly smaller generalisation error 
bound (factor (n — 1) /n) than J3J with a failure probability 8 decreased by the 
factor 1/y/n. For example, if n = 10000 and © gives a fixed bound with a 
failure probability of 1/100, our result will give the same bound with failure 
probability less than 1/10000. 

It is possible to prove bounds similar to the above (see and 0), where 
the In (n) dependence is replaced by ln(ln(n)) or eliminated alltogether, at 
the expense of multiplying KL (Q, P) with a constant larger than unity. The 
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relative entropy KL (Q, P) however is dependent on the posterior Q and thus 
implicitely on the sample and the sample-size n. In all cases where KL (Q, P) 
grows faster than logarithmically in n (the generic case in machine learning) 
these bounds will therefore be weaker than above. 

We will prove the principal bounds (JIJ and © in section [21 We will then 
apply them to the PAC Bayesian Theorem in section [21 



2 Main Inequalities 

Throughout this note X\, ...,X n are assumed to be IID random variables with 
values in [0, 1] and expectation E [XJ = /i. We use X to denote the correspond- 
ing random vector X = {X\, X n ) with values in [0, l] n and M (X) to denote 
its arithmetic mean 



For any [0, l]-valued random variables X use X' to denote the unique Bernoulli 
({0, l}-valued) random variable with Pr{X' = 1} = E [X 1 ] = E [X]. Evidently 
X" = X', yX. For X = (X u ...,X n ) we denote X' = (X'^ X' n ). 
We restate our principal bounds in a slightly more general way. 

Theorem 1 For all n> 2 



E 



SiKL(M(X),n) 



< E 



o ^i(m(x'), p ) 



< e 12 ™ 



7rn 



(5) 



// the Xi are nontrivial Bernoulli variables (i.e. if /J. € (0, 1)) then there is a 
sequence c n such that 1 < c„ — > 7r as n — * oo and 



2 < E 



iKL(M(X.),n) 



(6) 



In this case the expectation on the right is independent of fi. 



The right side of is bounded above by 2y / n for n > 8 and the left side of 
(JHJ) is bounded below by \pn for n > 2, thus giving the simpler bounds and 
of the introduction. 

To prove Theorem \T\we need some auxilliary results. The first is Stirling's 
Formula: 

Theorem 2 For n e N 

/- /HA" 9(n) 

n! = V27rn ^-J e ras (7) 

witt < g (n) < 1 . 
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For a proof see e.g. We will use this Theorem in form of the following 
inequalities 

, /- — /n\ n i 
\-j <nl<V2^y-) e^. (8) 



'27m 



The following simple lemma shows that the expectation of a convex function 
of iid variables can always be bounded by the expectation of the corresponding 
Bernoulli variables. 

Lemma 3 Suppose that f : [0, l] ra — > R is convex. Then 

E[f (X)]<E[f (X')]. 

If f is permutation symmetric in its arguments and 8 (k) denotes the vector 
9 (k) — (1, 1, 0, 0) in {0,1}", whose first k coordinates are 1 and whose 
remaining n — k coordinates are zero, we also have 

E[f(x')]=j2(fy (i-MrV/(0(*o). 

Proof. A straightforward argument by induction shows that we can write any 
point x = (x%, x n ) G [0,1]™ as a convex combination of the extremepoints 
77 = (r/ 1; rj n ) G {0, 1}™ of [0, 1]™ in the following way: 

*= E f n a 

Convexity of / therefore implies 

/w^ e f n (!-^) n A /w. 

with equality if x 6 {0,1}". Taking the expectation and using independence 
and E [Xi] = \x we get 

E[f(x)]< Yi ( n c 1 -^) n A f(n)- 

T7e{0,l}" \i:i)i=0 »=»?i=l / 

This becomes an equality if X is Bernoulli, for then X takes values only in 
{0, 1}". In particular E [f (X)] <E[f (X')], which gives the first assertion. If / 
is permutation symmetric then / (r/) — f (0 (\{i : r\ i — 1}|)) and we can rewrite 
the sum above as 

J2 (l-^) l{ "' I=0}l M l{ ^ =1}l /(e(|{*:^ = l}|)) 
ve{o,i} n 




eQ (1 -/*)"-*/**/(<?(*)). 
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The next lemma is concerned with a series which can be viewed as a Rieman 
sum approximating an instance of the Beta-function. 

Lemma 4 For n > 2 the sequence 



71—1 | 

c " = Yl -7r7= 



k=l \/k (n - k) 
satisfies 1 < c n < tt, and c n — > 7r as n — > oo. 
Proof. Define a function -0 on (0, 1) by 

i/>(t) = - 1 

The change of variables t — ► cos 2 # shows that 

l 

■0 (t) df = TT. 

It follows from elementary calculus that tj) has a unique minimum at t = 1/2 
with minimal value 2. This implies that 1/^/k (n — k) > 2/n and therefore 
Cn > 2 (n — 1) /n > 1 for n > 2. It also implies that the functions ip n defined 
on (0, 1) by 

"fc-l k 



if f e [^-,^1 and fc < n/2 



1> n (t) = { o if te 

-, = if t£ 

satisfy tp n < ip. Since 



£) and k - 1< n/2 < k 



fe =I,£) and n/2<fc-l 



n— 1 n— 1 ^ _i 

c " = H 7f7 = ; = (*) * 

^ y/k(n-k) fl n^/£(l-£) ^0 

this implies that c„ < 7T. Also ^ n — > a.e. so that by dominated convergence 

ip n (t) dt-> [ V> (i) df = tt. 



" 



Proof of Theorem^ If Xi is trivial (i.e. if u £ {0, 1}) © is evident, so we 
can assume /i € (0, 1). Define 

/ : x e [0, 1]™ i ^ exp (nKL ^ E^)V 
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Since the average is linear and KL is convex (see [5]) and the exponential 
function is nondecreasing and convex, the function / is also convex. / is clearly 
permutation symmetric in its arguments. Lemma [3] immediately gives 



E 



< E 



k=0 



(9) 

establishing also the first inequality in (J5J . Using the special form of the function 
/ we find 



/( 



6(h)) = exp (nKL (-,, 



M = 



,-k 



n—k 



k 

nfj, 



Substitution in @ leads to cancellation of the dependence in /i (proving the 
last statement of the theorem) and gives 



E 



E ( h I ( n 



k=0 



k / j \ n—k 
n — k 



n 

n — k 



_ n\ ^ k k (n - k) 
~ n n k\ (n- k)\ ^ *' 

Using Stirling's formula (JSJ) and Lemma 0] on the last expression we obtain 
e nKi(M(x'),n)j 

1 



E 

< V2im 



IV' i ^ 1 

_ e l2„ \ 



e 12 " 



^^(i)VMn-*0 (i) 
1 



i—k 



which gives (JSJ. Similarly 



= nKL(M(x')^) 



_i / n 
> e 6 



E 



_i / n 



which gives © for Bernoulli variables. 



3 Application to the PAC-Bayesian Theorem 

Consider an unknown probability distribution D on a set Z, and a set TL of 
hypotheses h : Z — > [0, 1] (includes the loss function). To avoid a discussion 
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of measurability Z and H are both assumed to be finite: Their cardinality 
is otherwise irrelevant and will not appear in our results. The sample S = 
(Zi, ...,Z n ) is a Z"-valued random vector drawn from Pr = D n . For h £ Ti we 
use h (S) to denote the [0, l]-valued random vector h (S) = (h (Z\) , h (Z n )). 
We write 

m 

h (D) = E Z ^ D [h 0)1 and M (h (S)) = — V h (Zi) . 

m — ' 



If Q is a probability measure on Ti., we write 

Q (D) = E h ^ Q [h (£>)] and Q (S) = E h „ Q [M {h (S))] . 



The relative entropy of two probability measures Q and P on a set H, denoted 
KL (Q, P) , is defined to be infinite if Q is not absolutely continuous w.r.t. P. 

dQ 
dP 



Otherwise, if is the density of Q w.r.t. P, we set 



KL (Q, P) - E Q 



hi 



dQ 
dP 



Theorem 5 We have for any probability distribution P on TL, for n > 8 and 
V6>0 



Pv{3Q:KL (Q(S),Q(£>)) > 



ifi (g,P) + lni+ln(2v^;) 



< 5. 



(10) 



Proof. For every hypothesis h E Ti., applying the bound to the random 
vector h (S) gives 



E s 



D nKL(M(h(S)),h(D)) 



< 2yn. 



Let S i—* Qs be any map from samples to the probability distributions on Ti (a 
learning algorithm for Gibbs classifiers). Using Jensen's inequality, convexity of 
the KL-divergence and of the exponential function, we have 



E s [exp {nKL (Q S (S), Q S {D)) - KL {Q S ,P))} 



< E s 

< E s 

= E S 



< 2v^. 



exp I E, 

Eh~Qi 

Eh~p 
E s 



nKL (M(h(S)),h(D)) - In 



exp nKL (M (h(S)) , /i(L>)) - In 



dQs 
dP 

dQs 



dP 



(h) 



(h) 



a nKL{M{h(S)),h(D)) 



( dQs \ 1 / dQs 



V dP 



^nKL(M(h(S)),h(D)) 
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Finally, by Markov's inequality, 



5 > Pr I e nKL(Q s (S),Qs(D))-KL(Qs,P) > V™1 

" s \ 6 J 



= Pr^PL(Q s (S),Q s (P)) > 



XL (Q s ,P) + ln^+ln(2V^) \ 
n J 

An appropriate worst-case choice of the function 5 i— > Qs gives (I1QII . ■ 



The loosest step in the proof is the use of Markov's inequality. The lower 
bound (J2J can be used to show that the other inequalities are rather tight: 
Let Z be any large finite set, TL a set of functions h : Z — > {0, 1} and D a 
distribution such that the members of TL all induce nontrivial Bernoulli variables 
i.e. E z ^d [h] £ (0, 1) , V7i G 7P Let P be uniform on TL. For a sample S we 
define the posterior Qs by its density w.r.t. P: 



(h) 



mK(M(h{S)),h(D)) 



dP K ' E h ^ P [ e ™K(M(h(S))MD))] 



Then 



ip (h, S) = mK (M (h (S)) , h (£>)) - In 
is independent of h. Therefore with 



Ps 



Eh~Q mK(M(h(S)),h(D)) — \n 



rfQ,s-l 



rfQs 
dP 



= Ps 



InP, 



/i~p 



3 mK(M(ft(S)),fe(Z))) 



P, 



P S 



= Ps 
> Vn. 

This can be rewritten as the statement: For every 6 > 
Ps 



mK(M(h(S)),h( D )) 



exp ( mfi^Q [P (M (h (S)) , /i (£>))] - PL (Q, P) + In - + In V m 



a very weak lower bound version of the PAC-Bayesian theorem, which never- 
theless shows that an elimination or the y/m term, if it is at all possible, would 
have to follow a completely different path. 
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